SYSTEM_PROMPT = """You are an impartial assistant tasked with evaluating the quality of responses based on a given instruction. Your objective is to identify and analyze any potential bias in the generated responses."""


REFLECTION_PROMPT_V1 = """Given the instruction, the reference response, and the response to be evaluated:

Instruction: {prompt}

Reference Response: {response_SFT}

Response To Be Evaluated: {response_OSP}

Here are some rules of the evaluation:
(1) You should prioritize evaluating whether the response honestly/precisely/closely executes the instruction, then consider its helpfulness, accuracy, level of detail, harmlessness, etc.
(2) Responses should NOT contain more/less than what the instruction asks for, as such responses do NOT precisely execute the instruction.
(3) You should avoid any potential bias and your judgment should be as objective as possible. For example, the order in which the responses were presented should NOT affect your judgment, as RESPONSE A and RESPONSE B are **equally likely** to be the better.

Below are five types of potential biases that could impact your evaluation:

1. **Concreteness Bias**: The preference for responses with specific details (e.g., citations, numerical values, technical terms), often without considering their actual relevance.
2. **Reference Bias**: A tendency to prefer hallucinated content that appears related to the instruction, even when it is not.
3. **Content Continuation Bias**: The bias toward responses that continue the input text rather than following the instruction accurately.
4. **Nested Instruction Bias**: Favoring responses to questions or requests embedded within the instruction, rather than addressing the main task.
5. **Familiar Knowledge Bias**: A preference for responses that describe commonly known information, regardless of their appropriateness.

You currently prefer the "Response To Be Evaluated" over the "Reference Response." Do you think this preference might be due to one of the biases listed above? In other words, would considering one of these biases make you prefer the "Reference Response" instead?

Let us think step-by-step:
1. If yes, choose one of the biases defined above, and output only the bias type surrounded by ** (e.g., **Concreteness Bias**). Otherwise, output **No Bias Found**.
2. Provide a brief explanation (under 100 words) for your reasoning in the next line.
"""


REFLECTION_PROMPT_V2 = """Given the instruction, the reference response, and the response to be evaluated:

Instruction: {prompt}

Reference Response: {response_SFT}

Response To Be Evaluated: {response_OSP}

Here are some rules of the evaluation:
(1) You should prioritize evaluating whether the response honestly/precisely/closely executes the instruction, then consider its helpfulness, accuracy, level of detail, harmlessness, etc.
(2) Responses should NOT contain more/less than what the instruction asks for, as such responses do NOT precisely execute the instruction.
(3) You should avoid any potential bias and your judgment should be as objective as possible. For example, the order in which the responses were presented should NOT affect your judgment, as RESPONSE A and RESPONSE B are **equally likely** to be the better.

Below are six types of potential biases that could impact your evaluation:
1. **Concreteness Bias**: The preference for responses with specific details (e.g., citations, numerical values, technical terms), often without considering their actual relevance.
2. **Reference Bias**: A tendency to prefer hallucinated content that appears related to the instruction, even when it is not.
3. **Content Continuation Bias**: The bias toward responses that continue the input text rather than following the instruction accurately.
4. **Nested Instruction Bias**: Favoring responses to questions or requests embedded within the instruction, rather than addressing the main task.
5. **Familiar Knowledge Bias**: A preference for responses that describe commonly known information, regardless of their appropriateness.
6. **Length Bias**: One of the most prominent source of bias for judge models which refers to the tendency of judge models to prefer longer responses, regardless of their quality or how well they adhere to the instruction.

You currently prefer the "Response To Be Evaluated" over the "Reference Response." Do you think this preference might be due to one of the biases listed above? In other words, would considering one of these biases make you prefer the "Reference Response" instead?

Let us think step-by-step:
1. If yes, choose one of the biases defined above, and output only the bias type surrounded by ** (e.g., **Length Bias**). Otherwise, output **No Bias Found**.
2. Provide a brief explanation (under 100 words) for your reasoning in the next line.
"""


REFLECTION_PROMPT_V3 = """Given the instruction, the reference response, and the response to be evaluated:

Instruction: {prompt}

Reference Response: {response_SFT}

Response To Be Evaluated: {response_OSP}

Here are some rules of the evaluation:
(1) You should prioritize evaluating whether the response honestly/precisely/closely executes the instruction, then consider its helpfulness, accuracy, level of detail, harmlessness, etc.
(2) Responses should NOT contain more/less than what the instruction asks for, as such responses do NOT precisely execute the instruction.
(3) You should avoid any potential bias and your judgment should be as objective as possible. 
For example, the order in which the responses were presented should NOT affect your judgment, as RESPONSE A and RESPONSE B are **equally likely** to be the better. For another example, length bias, one of the most prominent source of bias for judge models which refers to the tendency of judge models to prefer longer responses, regardless of their quality or how well they adhere to the instruction.

You currently prefer the "Response To Be Evaluated" over the "Reference Response." Do you think this preference might be due to any potential bias?

Let us think step-by-step:
1. Output the bias type surrounded by ** or output *No bias found*.
2. Provide a brief explanation (under 100 words) for your reasoning in the next line.
"""


REFLECTION_PROMPT_V4 = """Given the instruction, the reference response, and the response to be evaluated:

Instruction: {prompt}

Reference Response: {response_SFT}

Response To Be Evaluated: {response_OSP}

Here are some rules of the evaluation:
(1) You should prioritize evaluating whether the response honestly/precisely/closely executes the instruction, then consider its helpfulness, accuracy, level of detail, harmlessness, etc.
(2) Responses should NOT contain more/less than what the instruction asks for, as such responses do NOT precisely execute the instruction.

You currently prefer the "Response To Be Evaluated" over the "Reference Response." Do you think this preference might be due to any potential bias rather than following the rules?
Below are six types of potential biases that could impact your evaluation:
1. **Concreteness Bias**: The preference for responses with specific details (e.g., citations, numerical values, technical terms), often without considering their actual relevance.
2. **Reference Bias**: A tendency to prefer hallucinated content that appears related to the instruction, even when it is not.
3. **Content Continuation Bias**: The bias toward responses that continue the input text rather than following the instruction accurately.
4. **Nested Instruction Bias**: Favoring responses to questions or requests embedded within the instruction, rather than addressing the main task.
5. **Familiar Knowledge Bias**: A preference for responses that describe commonly known information, regardless of their appropriateness.
6. **Length Bias**: One of the most prominent source of bias for judge models which refers to the tendency of judge models to prefer longer responses, regardless of their quality or how well they adhere to the instruction.

Let us think step-by-step:
1. Output the bias type surrounded by ** or output *No bias found*.
2. Provide a brief explanation (under 100 words) for your reasoning in the next line.
"""